Methods for determining base locations in a polynucleotide

ABSTRACT

Disclosed are methods for polynucleotide sequencing that detect the location of selected nucleobases with greater precision. The methods can be used to determine the location and nature of modified bases in a polynucleotide, that is, non-canonical bases, or to improve accuracy of sequencing of “problem” regions of DNA sequencing such as homopolymers, GC rich areas, etc. The sequencing method exemplified is nanopore sequencing. Nanopore sequencing is used to generate a unique signal at a point in a polynucleotide sequence where an abasic site (AP site, or apurinic or apyrimidinic site) exists. As part of the method, an abasic site is specifically created enzymatically using a DNA glycosylase that recognizes a pre-determined nucleobase species and cleaves the N-glycosidic bond to release only that base, leaving an AP site in its place.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is continuation of Ser. No. 15/564,386 filed Oct. 4,2017, which is a 371 National Phase of International Patent ApplicationSerial No. PCT/US2016/026047 filed Apr. 5, 2016, which claims thebenefit of U.S. Provisional Patent Application No. 62/143,585 filed onApr. 6, 2015, which is hereby incorporated by reference in its entirety.

STATEMENT OF GOVERNMENTAL SUPPORT

This invention was made with government support under contracts HG006321and HG00782 awarded by the National Institutes of Health, NationalGenome Research Institute. The government has certain rights in theinvention.

REFERENCE TO SEQUENCE LISTING, COMPUTER PROGRAM, OR COMPACT DISK

The instant application contains a Sequence Listing which has beensubmitted electronically in ASCII format and is hereby incorporated byreference in its entirety. Said ASCII copy, created on Mar. 28, 2016, isnamed 482_41_PCT_seq_list.txt and is 3,234 bytes in size.

BACKGROUND OF THE INVENTION Field of the Invention

The present invention relates to the field of polynucleotides (e.g. DNA,RNA) and the field of polynucleotide sequencing (e.g. nanoporesequencing).

Related Art

Presented below is background information on certain aspects of thepresent invention as they may relate to technical features referred toin the detailed description, but not necessarily described in detail.That is, individual compositions or methods used in the presentinvention may be described in greater detail in the publications andpatents discussed below, which may provide further guidance to thoseskilled in the art for making or using certain aspects of the presentinvention as claimed. The discussion below should not be construed as anadmission as to the relevance or the prior art effect of the patents orpublications described.

BRIEF SUMMARY OF THE INVENTION

The following brief summary is not intended to include all features andaspects of the present invention, nor does it imply that the inventionmust include all features and aspects discussed in this summary

The present invention in general concerns polynucleotide sequencing,wherein said sequencing can detect the identity and location ofnon-canonical bases with high accuracy, using the distinctcharacteristic of an abasic site within a polynucleotide sequence.

The present invention comprises, in certain aspects, a method fordetecting a sequence of a polynucleotide molecule, comprising: (a)preparing a polynucleotide molecule which contains an abasic site; andb) conducting single molecule sequencing on the polynucleotide moleculeprepared in step (a), including determining a sequence that comprisesthe abasic site, whereby the abasic site is identified within thesequence, whereby the abasic site may be correlated with a nucleotide ina reference sequence.

The invention further comprises, with the foregoing, a method furthercomprising a step of treating the polynucleotide molecule with a reagentthat specifically acts on a specific nucleobase species to create saidabasic site. The nucleobase species may be one of A, T, G, C,5-methyl-cytosine, U, or another nucleobase species as described below.The invention further comprises, with the foregoing, a method whereinthe reagent is a glycosylase that removes the specific nucleobasespecies from a sugar in the polynucleotide backbone. The inventionfurther comprises, with the foregoing, a method wherein said abasic siteis correlated to one of (a) a non-canonical base; and (b) a location ofan A, T, C, or G base in a sequence. The invention further comprises,with the foregoing, a method comprising a step of correlating an abasicsite to a location within a homopolymeric stretch in said polynucleotidemolecule. The invention further comprises, with the foregoing, a methodcomprising a step of correlating an abasic site to an epigeneticmodification. The invention further comprises, with the foregoing, amethod wherein said epigenetic modification is one of 5-methylcytosine(5mC), 5-hydroxymethylcytosine (5hmC), 5-formyl cytosine (5fC) and5-carboxycytosine(5caC). The invention further comprises, with theforegoing, a method comprising the step of correlating an abasic site toDNA damage of DNA adducts, 8-oxo-guanine, alkylated bases,ethenoadducts, and thymine dimers.

The invention further comprises, with the foregoing methods, a methodwherein the single molecule sequencing is nanopore-based sequencingcomprises measuring an ionic current that identifies an abasic site. Theinvention further comprises, with the foregoing, a method wherein thenanopore-based sequencing includes detecting an ionic current through ananopore through which the polynucleotide passes.

In certain aspects, the present invention comprises a method fordetecting a sequence in a polynucleotide, comprising: (a) treating apolynucleotide molecule with a glycosylase that creates an abasic sitecorresponding to a predetermined nucleobase species in thepolynucleotide; (b) conducting sequencing on the polynucleotide preparedin step (a), wherein the sequencing indicates an abasic site within thepolynucleotide sequence; and (c) using the sequence from step (b) toidentify the abasic site and correlating said abasic site to saidpredetermined nucleobase.

The invention further comprises, with the foregoing, a method whereinthe predetermined nucleobase species is one of uracil, 5-methylcytosine(m5C), 5,6-Dihydrouracil, 5-Hydroxymethylcytosine, hypoxanthine andxanthine. The invention further comprises, with the foregoing, a methodwherein the polynucleotide molecule is RNA and the predeterminednucleobase species is one or more of pseudouridine (Ψ), dihydrouridine(D), inosine (I), and 7-methylguanosine (m7G). The invention furthercomprises, with the foregoing, a method further comprising the step oftreating the polynucleotide with a reagent that modifies a specificpredetermined nucleobase species to a different species of nucleobase byacting on the nucleobase, and then treating the polynucleotide moleculewith said glycosylase, wherein said glycosylase acts on the differentspecies of nucleobase. The invention further comprises, with theforegoing, a method wherein the reagent is 5 methyl-cytosine deaminaseand the glycosylase is G/T(U)-mismatch DNA glycosylase. The inventionfurther comprises, with the foregoing, a method the reagent is KRuO₄(potassium perruthenate) (to convert the 5-hydroxymethylcytosine basesto 5-formylcytosine). The invention further comprises, with theforegoing, a method wherein the glycosylase is one that lacks beta-lyaseactivity. The invention further comprises, with the foregoing, a methodwherein the glycosylase is mutated to lack beta-lyase activity.

The invention further comprises, with the foregoing a method wherein theglycosylase and substrate converted to an abasic site is as follows:uracil-DNA glycosylase (substrate being uracil, 5-fluorouracil,isodiauric acid, 5-hydroxyuracil, alloxan), G/T(U)mismatch-DNAglycosylase (G/G, A/G, T/C, T/U and U/C mismatches, uracil mismatch)alkylbase-DNA glycosylases(3-methyl guanine, O2-Alkylcytosine,5-formyluracil, 5-hydroxymethyluracil, hypoxanthine, N6-ethenoadeinine,N4-ethenocytosine, 7-chloroethyl-guanine, 3-Methyladenine,7-chloroethyl-guanine, 8-oxoguanine); 5-methylcytosine-DNA glycosylase(T in G/T mismatch, and 5-methylcytosine thymine DNA glycosylase,5-formylmethylcytosine (5fC), 5-carboxylcytosine (ScaC));adenine-specific mismatch DNA glycosylases (in G/A and C/A, and8-oxoguanine); DNA Glycosylases removing oxidized pyrimidines(EndoIII-like) (5-hydroxycytosine, 5,6-Dihydrothymine,5-Hydroxy-5,6-dihydrothymine, Thymine glycol, Uracil glycol, Alloxan,5,6-Dihydroxyuracil, 5-Hydroxy-5,6-dihydroxyuracil, 5-Hydroxyuracl,5-Hydroxyhydantoin, 2,5-Amino-5-formamidopyrimindine,4,6-Diamino-5-formamidopyrimidine,2,6-Diamino-4-hydroxy-5-foramimidopyrimidine; EndoVIII(5,6-dihydrothymine, thymine glycol); EndoIX(urea); hydroxymethyl-DNAglycosylase (uracil, 5-hydroxymethyluracil); hydroxymethyl-DNAglycosylase (uracil, 5-hydroxymethyluracil); formyluracil-DNAglycosylase (5-formyluracil); DNA glycosylases removing oxidized purines(8-oxoguanine, 2,5-Amino-5-formamidopyrimidine,4,6-Diamino-5-formamidopyrimidine,2,6-Diamino-4-hydroxy-5-foramimidopyrimidine 8-oxoguanine (opposite T),8-oxoguanine (opposite A); and pyrimidine-dimer-DNA glycosylases(4,6-diamino-5-formamidopyrimidine, cyclobutane-pyrimidine dimer.

The present invention further comprises, in certain aspects, a methodfor preparing a naturally occurring polynucleotide for sequencing, saidnaturally occurring sequence comprising a non-canonical base in thepolynucleotide, comprising the steps of treating the polynucleotide witha glycosylase enzyme to create an abasic site, and detecting the abasicsite. The invention further comprises, with the foregoing, a methodwherein the glycosylase is one of uracil-DNA glycosylase or5-methylcytosine DNA glycosylase. The invention further comprises, withthe foregoing a method further comprising the step of treating thepolynucleotide with an enzyme to modify a specific base species and thenremoving bases modified by the enzyme with said glycosylase enzyme.

The present invention further comprises, in certain aspects, a methodfor preparing a polynucleotide for sequencing, said polynucleotidecomprising a sequence of canonical bases, comprising the step ofmodifying selected canonical bases with a glycosylase enzyme thatremoves the selected bases from the polynucleotide to create abasicsites within the polynucleotide.

The present invention further comprises, in certain aspects, a methodfor improving sequence accuracy in a naturally occurring polynucleotide,comprising: (a) conducting a first sequencing of the polynucleotide;(b)treating a copy of the polynucleotide to remove a portion of baseshaving a predetermined structure of one of A, T, C, or G; (c) conductinga second sequencing of a polynucleotide treated in step (b) andidentifying abasic sites corresponding to said predetermined structure;and (d) comparing results of the first sequencing and the secondsequencing and correlating abasic sites found in the second sequencingwith predetermined sites in the first sequencing, said comparisonproviding improved sequence accuracy. The invention further comprises,with the foregoing a method comprising converting a number of cytosinebases to 5-methylcytosine and removing the 5-methyl cytosine with5-methyl cytosine glycosylase to create abasic sites at the locations ofcytosines.

The present invention further comprises, in certain aspects, a kit forsequencing polynucleotides comprising a nanopore sequencing device, anenzyme selected from the group consisting of uracil-DNA glycosylase,G/T(U)mismatch-DNA glycosylase, alkylbase-DNA glycosylase,5-methylcytosine-DNA glycosylase, thymine DNA glycosylase,adenine-specific mismatch DNA glycosylase, Fpg/Nei DNA glycosylase,Endonuclease III, Endonuclease VIII, Endonuclease IX, hydroxymethyl-DNAglycosylase, formyluracil-DNA glycosylase, formamidopyrimidine-DNAglycosylase, Fpg protein, Nei protein, and pyrimidine-dimer-DNAglycosylase; and instructions for detection of abasic sites by thenanopore sequencing device.

The present invention further comprises, in certain aspects, a computerprogram operative with a nanopore sequencing device for sequencing ofpolynucleotides containing abasic sites, comprising: (a) a look-up tablewith ionic current values for both (i) canonical base sequences and (ii)base sequences that include abasic sites at various positions inotherwise canonical base sequences; and (b) an algorithm applying thelook-up table to translate ionic current values into a base sequencethat includes the position of abasic sites in the sequenced DNA strand.The invention further comprises, with the foregoing computer programwherein the algorithm implements a hidden Markov model.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A, 1B is a schematic diagram and a set of traces, respectively,illustrating the process of sequencing a polynucleotide using abasicsites. FIG. 1A is a schematic diagram showing a polynucleotide havingsites 1 and 3 occupied by specific nucleotbases bases (e.g. A,T, G, C,),while site 2 in the polynucleotide contains an abasic site. Apredetermined nucleobase species has been removed as shown at 102. Insome cases, such as analyzing homopolymeric stretches (e.g. 5, 10, 20etc. of the same repeating base species, such as poly T or poly C),some, but not necessarily all, of the predetermined polynucleobases willbe removed. As shown at 104, FIG. 1A bottom panel, the polynucleotide isanalyzed using a nanopore-based sequencing method. A nanopore-basedsequencing procedure is used, where bases are sequentially interrogated,individually or in groups. A sequence signal, such as a difference incurrent though a pore, electrical property or the like, is generatedbased on the sequence. The sequence signal is analyzed by appropriatelogic means (hardware or software processing signals from the detector)to produce a distinct signal for individual bases, wherein the locationof the abasic site replacing the modified base is determined, along withcanonical bases.

FIG. 1B illustrates a set of traces showing the ionic current differencebetween cytosine, 5-methylcytosine, and an abasic site in a nanoporebased sequencer. Shown are ionic currents for 3 different DNApolynucleotides of thesequence-TTTTTTTTTC5mCGGTTTTTTTTTCCGGTTTTTTTTCabasicGGTTTTTTTTT (SEQ IDNO: 1). Of note is the consistent ˜20 pA increase in ionic current wheneither a cytosine or a 5-methylcytosine is converted to an abasicposition. Each of Events 1, 2 and 3 (top panel, middle panel and bottompanel of FIG. 1B) represents a single DNA strand analyzed in thenanopore device.

FIG. 2 is a schematic diagram showing steps that illustrate embodimentsof the present process using a double stranded DNA with a polynucleotidehaving standard nucleobases A, T, G, C, and a non-standard base X, whichmay be as listed here, or in Table 1. X may include epigenetic markerson genomic DNA (e.g. C5-methylcytosine or mC) and various forms of DNAdamage, including DNA adducts (drugs or carcinogenic pollutants),mitochondrial genome damage (8-oxo-guanine), base analogs (AIDS andanti-cancer drugs), UV damaged DNA (thymine dimers), and ethenolatedbases that include: 1,N⁶-alpha-hydroxyethenoadenine; 1,N⁶-ethenoadenine;3,N⁴-ethenocytosine; 1,N²-ethenoguanine; and N²,3-ethenoguanine.

In FIG. 2 step 1, the non-standard base X is converted into an abasicsite by the reaction of the polynucleotide with specific DNAglycosylase, such as listed here and further listed in Table 1. SpecificDNA glycosylases include but are not limited to, 5-methylcytosine DNAglycosylase, uracil DNA glycosylase, G/T(U) mismatch DNA glycosylase,alkylbase DNA glycosylases, DNA glycosylases removing oxidizedpyrimidines, DNA glycosylases removing oxidized purines (e.g.endonuclease III (Endo III) andfFormamidopyrimidine-N-glycosylase[Fpg]), hydroxymethyl-DNA glycosylase, formyluracil-DNA glycosylase, andpyrimidine-dimer-DNA glycosylases. The example sequences shown have thesequences ACCCGTAACCGTAXTCGCGTAAA (strand to be sequenced; SEQ ID NO:2), TTTACGCGAATACGGTTACGGGT (complementary strand; SEQ ID NO: 3), andACCCGTAACCGTAabasicTCGCGTAAA (strand to be sequenced, with abasic site;SEQ ID NO: 4). In step 2, the sequence of the polynucleotide having theabasic sites is detected by a nanopore sequencing device, exemplified bya MINION™ sequencing device, made by Oxford Nanopore Technologies,Oxford, UK. Such a device and its use are described, e.g. in “Improveddata analysis for the MINION™ nanopore sequence,” Jain et al. NatureMethods 12:351-356 (2015) and in “Poretools: a toolkit for analyzingnanopore sequence data.” Nicholas J. Loman & Aaron R. Quinlan.Bioinformatics, 30(23) 3399-3401(2014). The bottom panel shows highcurrent abasic sites.

FIG. 3 is a pair of panels that show raw data from single control LambdaDNA molecules with no dUDP or glycosylase (top panel) and single LambdaDNA molecules after dUTP incorporation and glycosylase treatmentcreating abasic sites (bottom panel). The raw date (current traces) canbe obtained as follows:

Step A Beginning with native Lambda DNA, use PCR to copy Lambda DNA with4 canonical dNTPs and dUTP. (top panel) Step B Take copied DNA with 4canonical bases and replace U replacing T at randomly distributedportions. Step C Treat uracil containing DNA with uracil DNA glycosylase(UDG) Step D UDG creates abasic sites in copied DNA at positionspreviously containing uracil. Step E Obtain raw data from nanoporesequencing of DNA from steps A and D shows current spikes at abasicpositions (bottom panel).

Thus, strand replication of a duplex molecule of Lambda DNA (DNA fromEnterobacteria phage λ) by a DNA polymerase generates a copy of theoriginal DNA strands which contain the four canonical bases A, C, G, T,and in random positions, have replacement of T with U (in the copy ofthe original DNA strand),(step B). The uracil -containing DNA is treatedwith UDG (uracil-DNA glycosylase) to produce abasic sites at positionsof U incorporation into the DNA. Uracil DNA glycosylase enzyme isreacted with the polynucleotide to remove the uracil base, leaving thesugar phosphate backbone intact in the polynucleotide, but producing anabasic site at positions where the uracil had been incorporated into theDNA strand (Step D). In the panels of FIG. 3, raw data (nanopore currenttraces) from nanopore sequencing of DNA is shown in the top panel.Abasic sites (bottom panel) are identified by high current segments inthe nanopore sequence data shown, where the horizontal line (“dottedline”, bottom panel at about 90 pA) indicates current values in excessof those recorded for canonical DNA bases, i.e. current segments abovethe dotted line, mark positions of abasic sites. The current indicatingan abasic site is increased above the current peaks observed forcorresponding canonical bases and, in this example, peaks greater thanabout 90 pA. Current segments above the “dotted line” thus mark abasicpositions within the polynucleotide sequence. The current traces shownin FIG. 3 (top and bottom panels) also show the location of a sequenceloading adaptor and a hair pin adapter. These are oligonucleotides addedto the DNA to be analyzed, according to a known nanopore protocol. Theseadaptors are used in preparation of the DNA before sequencing on thenanopore sequencer. The sequence loading adaptor is used in enzymaticloading of the DNA molecule into a nanopore for sequencing, and thehairpin adapter permits contiguous sequencing of both the template andcomplement strands by connection into a single strand, as illustratedschematically in FIG. 7C and 7D. The raw data in FIG. 3 is interpretedby software specific for the nanopore-based sequencing device togenerate a DNA sequence indicating the location of the abasic sites.

FIG. 4A, 4B, 4C, and 4D is s series of graphs that show thedistributions of mean currents (from 0 to 250 pA) for segmented currentdata used for base sequence determination in a nanopore sequencer (i.e.the MINION™ nanopore sequencer). Mean current levels derived fromsegmented current data, are characteristic of specific bases in apolynucleotide strand as they transit through the nanopore. Thesecurrent levels sequentially change in a characteristic patterndetermined by the sequence of the polynucleotide, and are used todetermine DNA base sequence in nanopore sequencing. For the MINION™nanopore sequencer, bases are processed through the nanoporesequentially one base at a time; however the sequencer detects 5 basesproximate to the aperture of the nanopore resulting in 1024 differentcurrent levels corresponding to all combinations of 5 bases. FIG. 4A,4B, 4C and 4D show the distribution of currents for 4 sequencing runs,one for each of 4 different populations of duplex Lambda DNA moleculesfrom Enterobacteria phage λ. The Y axis for FIG. 4A-4D is expressed asthe percent of the total population of current segments from thatsequencing run. FIG. 4A is the distribution of currents for Lambda DNAcontaining the 4 canonical bases. FIG. 4B is the distribution of currentsegments for Lambda DNA containing the 4 canonical bases (ACGT) and withUracil substituted for T at random positions. FIG. 4C is thedistribution of currents for Lambda DNA containing the 4 canonical basesthen treated with DNA Uracil Glycosylase. FIG. 4D is the distribution ofcurrents for Lambda DNA containing the 4 canonical bases (ACGT) withUracil substituted for T at random positions and treated with DNA UracilGlycosylase. Abasic positions are associated with a high currentrelative to positions within a DNA strand that contain a canonical base.Only the density profile shown in 4D is expected to contain abasicsites. Note the peak at high current levels revealed as a spike on theshoulder of the current level distribution circled shown in the vicinityof 80 pA, and its absence in 4A, 4B and 4C. This is direct evidence ofthe detection of abasic positions in the polynucleotide populationsexamined.

FIG. 5A and 5B are a pair of current traces from control DNA and DNAwhere U bases have been removed by a uracil-DNA glycosylase. This figureshows MINION™ nanopore current data for a control DNA strand (5A) andabasic site containing DNA strand (FIG. 5B) obtained from nanoporesequencing runs. Comparison of control Lambda DNA (5A) and glycosylasetreated DNA (5B) shows current spikes above the reference line (dottedline) at about 90 pA shown in FIG. 5A and 5B. The current segments above90 pA are from abasic sites created by removal of uracil from thepolynucleotide by uracil-DNA glycosylase.

FIG. 6 is a schematic diagram that shows two approaches to a procedurefor detecting high current signals at abasic sites (arrows at bottompanel) in a nanopore (MINION™) sequencing device. Abasic sites are thosewhere C5-methylcytosine, having been prepared with a G base on anopposite strand, are removed, thereby identifying the C5-methylcytosinein the sample. In Approach I, a representative dsDNA molecule containinga C5-methylcytosine at a random location is treated with 5methyl-cytosine DNA glycosylase to produce an abasic site. In approachII, a molecule such as in Approach I is treated with 5 methyl-cytosinedeaminase to convert the C5-methylcytosine to a T, creating a T/Gmismatch. This molecule is then treated with G/T(U)-mismatch DNAglycosylase to create the desired abasic site. As shown in the bottompanel, high current signals at abasic sites can be detected by increasedcurrent.

FIG. 7A is a schematic representation of a process to improve theaccuracy of homopolymeric runs of T's and A's in a DNA molecule beingsequenced. In a first step, DNA polymerase and a mixture containing U isused to replace a fraction of T's in the sequence with U's, giving DNAduplexes such as shown on the left of the figure. In a second step,uracil DNA glycosylase is reacted with the product of the first step andused to create abasic sites, giving DNA duplexes such as shown on theright of the figure. Then, the abasic sites are identified as describedabove, using nanopore-based sequencing. The abasic site-containingsequences obtained are aligned using bioinformatics tools. Positions ofabasic sites in the alignment are determined and counted. 5mers withaltered current profiles due to abasic positions within the 5 mer aredetermined. This information is used to determine the length of thehomopolymeric run of T's. Alternatively a DNA glycosylase engineered toremove T can be used to effect the same analysis in fewer enzymaticsteps.

FIG. 7B shows an approach for detecting repetitive (homopolymeric)sequences of C. Using a DNA glycosylase engineered to remove C, asimilar analysis to that for homopolymeric tracks of T can be completedfor homopolymeric tracts of C.

FIG. 7C, 7D shows an approach for detecting certain bases that can bereplaced with bases that are glycosylase substrates. Also shown is adsDNA prepared with a hairpin to enable individual sequencing of bothstrands of the dsDNA.

FIG. 8 is a schematic diagram that shows an improved method ofsequencing RNA using the present method, where abasic sites are created.In step 1, reverse transcriptase is used to create a DNA-RNAheteroduplex. The provided nucleotides for polymerization include U tobe incorporated complementary to the A in the RNA. In step 2, DNAglycosylase is used to remove the U residues inserted in step 1, e.g.with uracil DNA glycosylase, to create abasic sites that are detected instep 3 by a MINION™ nanopore-based sequencing device.

FIG. 9 shows structures of cytosine and modified cytosines, which arethe subject of a process for a combination of different enzymatictreatments of a polynucleotide that will detect different basemodifications in a single sequence in a sample, as described below. Themethod detects and distinguishes cytosine, 5-methylcytosine, 5-hydroxycytosine, 5-formyl cytosine and 5-carboxycytosine. In each case thecytosine species is treated with a glycosylase specific for Cyt or thenon-canonical base of interest (5mC, 5HmC, 5fC, or 5caC) to generate anabasic site at the specific location of the species of interest. If anappropriate glycosylase is available, a two-step process is carried out.Alternatively, in the first step, the cytosine species is enzymaticallyconverted to another base that can be removed to form the abasic site.

FIG. 10 is a schematic representation of a sequencing process whereinmodified bases are detected by preparation of a complementary strand tothe template strand. The complementary strand is prepared with thecanonical bases. The modified bases are excised as described above, andthe sequence will have information both of the abasic site and thecomplement of the original modified base.

DETAILED DESCRIPTION Definitions

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by those of ordinary skillin the art to which this invention belongs. Although any methods andmaterials similar or equivalent to those described herein can be used inthe practice or testing of the present invention, the preferred methodsand materials are described. Generally, nomenclatures utilized inconnection with, and techniques of, cell and molecular biology andchemistry are those well-known and commonly used in the art. Certainexperimental techniques, not specifically defined, are generallyperformed according to conventional methods well known in the art and asdescribed in various general and more specific references that are citedand discussed throughout the present specification. For purposes ofclarity, the following terms are defined below.

Ranges: For conciseness, any range set forth is intended to include anysub-range within the stated range, unless otherwise stated. As anon-limiting example, a range of 120 to 250 is intended to include arange of 120-121, 120-130, 200-225, 121-250 etc. The term “about” hasits ordinary meaning of approximately and may be determined in contextby experimental variability. In case of doubt, the term “about” meansplus or minus 5% of a stated numerical value.

The term “abasic” or “abasic site” is used in its conventional sense. Anabasic site is also known as an AP site, or apurinic or apyrimidinicsite. In a DNA or RNA strand, an abasic site is one in which the base isnot present, but the sugar phosphate backbone remains intact. Abasicsites may exist in various tautomeric forms within a polynucleotide. Fordetails, see Krotz et al., U.S. Pat. No. 6, 586,586, “Purification ofOligonucleotides,” issued Jul. 1, 2003. As described there, an abasicsite may comprise a mixture of four chemical species in a tautomericequilibrium. For example, an abasic site can be an apurinic orapyrimidinic site located on an oligonucleotide, wherein an aldehydemoiety is present. For example, this is shown in FIG. 4 of U.S. Pat. No.6,586,586.

The term “modified base” refers to a nucleobase in a polynucleotide (DNAor RNA) which is not one of the 5 canonical standard bases, i.e. A,C,G,and T and, in RNA, U. U (uracil) may be regarded as a modified base inDNA and a canonical base in RNA. Using conventional notation, in allsequences here, “A” stands for adenine. “G” stands for guanine, “C” forcytosine, and “T” for thymine. Adenine always pairs with thymine.Cytosine always pairs with guanine.

The term “predetermined nucleobase species” accordingly refers to aselected species such as A, C, G, T, or U as canonical base sequences,or a species that is modified as described herein, such as methylcytosine (e.g. 5-methyl cytosine), 5-fluorouracil, 3-methyladenine,5-carboxycytosine, 8-oxoguanine, etc. as set forth further in Table 1.

The term “polynucleotide” is used in its conventional sense to includepolynucleotides that are human DNA, human RNA, human cDNA andcounterparts in other organisms including plants, microorganisms andviruses. The polynucleotides used here typically comprise two strands ofa DNA molecule that occur in an antiparallel orientation, where onestrand is positioned in the 5′ to 3′ direction, and the other strand ispositioned in the 3′ to 5′ direction. The terms 5′and 3′, as isconventional, refer to the directionality of the DNA backbone, and arecritical to describing the order of the bases. The convention fordescribing base order in a DNA sequence uses the 5′ to 3′direction, andis written from left to right. The term polynucleotide includesoligonucleotides, and in general, is formed from a plurality of joinednucleotide units, including linear sequences of nucleotides, in whichthe 5′ linked phosphate or other internucleotide linkage on one sugargroup is covalently linked to either the 2′-, 3′-, or 4′-position on theadjacent sugars. Also included within the definition of polynucleotideas an “oligonucleotide” are other double stranded oligonucleotidesincluding DNA, RNA and plasmids, vectors and the like. Thus, the term“oligonucleotide” includes linear sequences having 2 or morenucleotides, and any variety of natural and non-natural constituents asdescribed below.

The term “reference sequence” refers to a known sequence thatcorresponds to a sequence obtained during a sequencing method beingcarried out. A sequence corresponds to a sequence of interest if it maybe presumed to have a high degree of sequence identity (>90%) to thesequence under study within the region of interest. A reference sequencemay be a sequence determined by sequencing a nucleotide in the samesample as a sequence under study, or it may be a sequence obtained froma database of known sequences. A reference sequence may be obtained bycomparison to, for example, the obtained from the UCSC Genome Browser.

The term “nanopore-based sequencing”, as used herein, means a processfor determining the order in which specific nucleotides occur on astrand of polynucleotide, based on a physical interrogation of monomersin a single strand at a time. Individual monomers may be identifiedone-by-one, in unique groups (e.g. 5 mer), or otherwise uniquelyidentified by their structural characteristics.

The term “nanopore-based sequencing method” further refers to use of aphysical structure in the form of a nanopore or equivalent structure. Ananopore itself is simply a small hole, of the order of 1 nanometer ininternal diameter, through a thin film through which a polynucleotidebeing sequenced passes. The theory behind nanopore sequencing is thatwhen a nanopore is immersed in a conducting fluid and a potential(voltage) is applied across it, an electric current due to conduction ofions through the nanopore can be observed. Nanopore sequencing devicesare described, e.g. in Schneider & Decker, “DNA Sequencing withnanopores,” Nature Biotech. 30 326-328 (2012), which is incorporated forfurther descriptions of nanopore systems, enzymes and pores used. Seealso Akeson et al. “Methods and apparatus for characterizingpolynucleotides,” U.S. Pat. No. 7,238,485; Peng et al., “Electron beamsculpting of tunneling junction for nanopore DNA sequencing,” U.S. Pat.No. 8,858,6764; and Ju “DNA sequencing by nanopore using modifiednucleotides,” U.S. Pat. No. 8,889,348. In connection with the latterpatent, it will be described herein that the modifications of specific,pre-selected bases are carried out and distinguished from the canonicalbases. An enzyme (e.g. a DNA polymerase or other transposase) is used tomodulate the passage of the polynucleotide through the nanopore).

As also described in the preceding referenced patent, the presentmethods may also detect an electronic signature, where an “electronicsignature” of a nucleotide passing through a pore via application of anelectronic field shall include, for example, the duration of thenucleotide's passage through the pore together with the observedamplitude of current during that passage. Electronic signatures can bevisualized, for example, by a plot of current (e.g. pA) versus time.Electronic signature for a DNA is also envisioned and can be, forexample, a plot of current (e.g. pA “picoamperes”) versus time for theDNA to pass through the pore via application of an electric field.

Another embodiment of nanopore sequencing is nanopore sequencing withcurrent detection or optical detection, physical molecule (magnetic)extension (Ding et al. “Single-molecule mechanical identification andsequencing, Nature Methods 9, 367-372 (2012).

Nanopore-based sequencing may, in some embodiments, employ abase-by-base interrogation of a single polynucleotide molecule. Methodshave been used for nanopore physically based sequencing that use, forexample, electron tunneling. Measurement of electron tunneling throughbases as ssDNA translocates through the nanopore may be used. Mostresearch has focused on proving bases could be determined using electrontunneling. These studies were conducted using a scanning probemicroscope as the sensing electrode, and have proved that bases can beidentified by specific tunneling currents (Chang, S; Huang, S; He, J;Liang, F; Zhang, P; Li, S; Chen, X; Sankey, O; Lindsay, S (2010).“Electronic signatures of all four DNA nucleosides in a tunneling gap”.Nano Lett. 10: 1070-1075).

In the present nanopore-based sequencing, no labelling of thenucleotides or biologic functionality is required to determine thesequence during the nanopore base sequencing step. The term“nanopore-based sequencing method” includes, but is not limited to,“nanopore sequencing,” which is a nanopore-based sequencing methodexemplified herein.

The term “RNA glycosylase” refers to an enzyme that catalyzes thehydrolysis of N-glycosylatic bonds in an RNA molecule. This includes EC3.2.2.22. The references here to DNA glycosylases may be used to employRNA glycosylases for RNA.

The term “DNA glycosylase” refers to a family of enzymes that removebases from DNA. Typically such enzymes are involved in base excisionrepair, classified under EC number EC 3.2.2. Based on structuralsimilarity, glycosylases are grouped into six structural superfamiliesThe UDG and AAG families contain small, compact glycosylases, whereasthe MutM/Fpg and HhH-GPD families comprise larger enzymes with multipledomains.

Another DNA glycosylase used here is uracil-DNA glycosylase, whichexcises uracil from dU-containing DNA by cleaving the N-glycosidic bondbetween the uracil base and the sugar backbone. This cleavage generatesalkali sensitive apyrimidinic sites that are blocked from replication byDNA polymerase or prevented from becoming a hybridization site. Thisglycosylase is also referred to here for convenience as a DNAglycosylase. Alternative glycosylases used here include glycosylaseenzymes engineered to remove beta-lyase activity for the purpose ofusing the glycosylase to create abasic sites for detection of modifiedbases with a nanopore DNA sequencer.

Alternative glycosylases useful here include glycosylase enzymesengineered to have altered base specificity for the purpose of using theglycosylase to create abasic sites for improvement of nanopore DNAsequencer accuracy and performance Examples modified human uracil DNAglycosylase (Gene Symbol UNG) include conversion of Tyr147 to Ala147,resulting in activity of the glycosylase cleaving both uracil andthymine. Similarly, changing Asn204 to Asp204 results in thisglycosylase cleaving both uracil and cytosine. For details, see Kav B;Slupphaug, G; Mol, CD; et al. “Excision of cytosine and thymine from DNAby mutants of human uracil-DNA glycosylase,” EMBO JOURNAL Volume: 15Issue: 13 Pages: 3442-3447 Published: Jul. 1, 1996.

The term “epigenetic modification” is used to refer to an alteration ina DNA sequence that changes a base in the sequence from a canonical Baseto another chemical species. Examples include 5-methyl cytosine,5-hydroxymethyl cytosine, 5-formyl cytosine, 5-carboxy cytosine, etc.

The term “single molecule sequencing” refers to a sequencing method thatis not carried out on a population of molecules amplified from a sample.Several art-recognized methods of single-molecule sequencing have beendeveloped (see U.S. patent application US2006000400730 and U.S. Pat.Nos. 7,169,560; 6,221,592; 6,905,586; 6,524,829; 6,242,193; 6,221,592;and 6,136,543. Commercial examples include the Oxford NanoporeTechnology MINION™ and GRIDION™ devices, the Helicos BiosciencesCorporation HELISCOPE™, the SMRT sequencing method from PacificBiosciences of California, Inc., etc.

General Method and Apparatus

The present nanopore-based sequencing process is preferably carried outin a high-throughput device, and employs computer technology to measurethe physical parameters associated with translocation of the bases in apolynucleotide through a nanopore, e.g. ionic current blockade as basesocclude the nanopore during translocation. As is known, the device willcontain logic devices for analyzing the raw sequence and matching it tospecific bases in a sequence. As explained below, an important physicalcharacteristic of abasic sites is the lack of a base in a moiety in thepolynucleotide, as shown in FIG. 1A.

Detection of polynucleotide base sequence with the presently exemplifiednanopore sequencing relies on detection of changes in ionic current as apolynucleotide passes through the nanopore. The electrical signature(ionic current) from an abasic site passing through a nanopore isincreased and distinguishable from any base-containing site. It isunderstood that the polynucleotide used in the nanopore physically basedsequencing will be prepared for sequencing by various processes, andwill be acted upon by various chemical or biological agents, such asenzymes controlling translocation of the polynucleotide, but thedetection system makes the distinction shown in FIG. 1A by directlydetecting the nature of the bases, and particularly the lack of a base.The present methods may be used with existing sequencing hardware andsoftware, and the sequence signal obtained can be interpreted bysoftware routines that call the nucleotide at sites 1 and 3 as perpreprogrammed base calling, and also recognize site 2 as abasic. Aschematic representation of a sequence signal is shown at the bottom ofFIG. 1A. Although site 2 here is a distinctly higher peak, alternativehardware and software could characterize site 2 in other ways todistinguish it from canonical bases at site 1 and site 3. Nanoporeelectrical current signals from abasic sites (no base, but intact sugarphosphate backbone) within DNA strands are significantly different fromelectrical current signals for adenine, cytosine, guanine and thyminebases, as well as modified bases, and are reliably detected. Thisinvention determines the specific location of a modified DNA base e.g.5-methylcytosine (mC) by using modified base specific DNA glycosylasesto create abasic sites within DNA duplex molecules at the site of themodified base. The newly created abasic positions within the DNA strandare then detected by a nanopore sequencer such as the MinION™ nanoporesequencer with high precision and recall. This invention can be used todetect a wide spectrum of modified DNA and RNA bases as well as improvesequencing accuracy. Applications of this invention include epigeneticsequencing with the MinION or other nanopore sequencer and detection ofcancer chemotherapeutic agents, chemical mutagens and carcinogens boundto DNA.

The present methods comprise selecting the modified base (or set ofbases) that will be determined in a particular embodiment. If there is amodified base in a sequence (e.g. DNA double helix), one firstdetermines if the modified base to be detected is a known and specificsubstrate for a glycosylase. If it is, the modified base (in thepolynucleotide strand being characterized) is treated directly with theappropriate glycosylase to generate abasic sites where the modified basewas located in the strand. The present methods may comprise apreparation where a polynucleotide, e.g. cDNA, genomic DNA, mRNA,genomic RNA, etc., may be analyzed for the presence of a subset ofbases. The bases of interest may be modified bases (i.e. modificationsof canonical A,T,C,G,or U) wherein the modified base of interest ispresent in the polynucleotide in the form of a substrate for aglycosylase. If the modified base of interest is not a substrate for aknown glycosylase, the modified base is treated to convert it into anucleobase that is a substrate to a known glycosylase. That is, it maybe treated by a base-specific base modification enzyme, or treated bychemical conversion, etc. so as to produce an appropriate substrate.After the desired abasic sites are created, one carries out nanoporephysically based sequencing, which will, as described below, generate areadily determined signal at the point in the sequence where the abasicsites exist.

In another embodiment, one may determine epigenetic modified cytosinesin genomic DNA. If the modified base is 5-methylcytosine, which is aglycosylase substrate, one treats the DNA of interest with5-methylcytosine glycosylase, causing removal of 5-methylcytosines fromthe sequence and leaving abasic sites in its place. Determination of thesequence and detection of the created abasic sites is carried out, againusing nanopore physically based sequencing (e.g. MinION™ mobile DNAsequencer). Alternatively, or additionally, if one wishes to determinethe presence of 5-hydroxymethylcytosine bases in the sequence, one cantreat the sample with KRuO4 to convert the 5-hydroxymethylcytosine basesto 5-formylcytosine. Then, the sample may be treated with thymine DNAglycosylase to produce abasic sites at the bases of interest. Again theabasic sites are readily determined as to their location in the sequenceof the sample, as described below, namely with current spikes higherthan that generated by base-containing residues.

As will be described below, the present invention is a broadlyapplicable method for sequence specific detection of modified bases.Applications of the present methods include identification of epigeneticmarkers on cytosine; improvement of nanopore sequencing accuracy;biological monitoring of chemotherapeutic drugs; biological monitoringof environmental carcinogens, and so forth.

Examples are given here of experiments that generated a set of abasicsites in DNA at positions of T within the DNA strand. This DNA was thensequenced in the MinION Nanopore Sequencer and data analyzed for thepresence of abasic sites. See FIG. 2 for an outline of the use of DNAglycosylases to create and detect abasic sites at predetermined pointswithin a sequence. Similarly, FIG. 3 shows nanopore sequencing dataobtained from Lambda phage DNA. The experimental details are givenbelow.

EXAMPLE 1A Preparation of Modified and Abasic Polynucleotides

-   -   1. A 3.7 kb fragment of Lambda phage was amplified using PCR.        PCR reactions were run under two conditions, a control        (canonical dNTPs) and a PCR reaction that included the 4        canonical dNTPs plus dUTP.    -   2. The PCR products were purified, re-suspended in high purity        water and quantified on a Nanodrop UV/VIS spectrophotometer. Two        hundred nanograms of PCR products were run on a 0.8% agarose gel        to confirm size and purity.    -   3. Two aliquots of 2 ug each, one from the control canonical        dNTPs PCR reaction and one from the dUTP PCR reactions, were        used to prepare sequencing libraries using ONT (Oxford Nanopore        Technologies) standard procedures (ONT library preparation        procedure 004).    -   4. One aliquot from each PCR reaction was treated for 15 min        with Uracil—DNA glycosylase and bead purification immediately        after end repair in the library preparation process. This        resulted in the generation of 4 samples for sequencing. The 4        reactions have the following compositions.        -   a. Control i.e. 3.7 kb lambda fragment containing canonical            bases ACGT, and no abasic sites        -   b. Control i.e. 3.7 kb lambda fragment containing canonical            bases ACGT and Uracil—DNA glycosylase treatment, with no            abasic sites.        -   c. Uracil PCR reaction containing 3.7 kb lambda fragment            containing canonical bases ACGT plus uracil, with no abasic            sites.        -   d. Uracil PCR reaction containing 3.7 kb lambda fragment            containing canonical bases ACGT plus uracil which was            treated with Uracil-DNA glycosylase. Contains abasic sites            at sequence positions with incorporation of dUTP.

EXAMPLE 1B Sequence Analysis including Abasic Sites and SequenceCoverage

Abasic sites in a polynucleotide are created at specific base locations,as described above. The polynucleotide, containing canonical bases andabasic sites, is then sequenced in a nanopore-based sequencing device.As is known, such a device can produce ionic current tracescorresponding to translocation of a DNA strand through the nanopore;these can be divided into segments of ionic current, where the mean andvariance in current for each segment depends on the identity of thebases proximal to the “reading head” of the nanopore. The “reading head”of the nanopore includes both the limiting aperture and adjacent areasof the nanopore where bases in the polynucleotide can interact with thenanopore to alter ionic current through the nanopore. The first step inanalyzing a current trace (shown e.g. in FIG. 1B) is to segment it,using appropriate electronics and software. Five to five hundredsegments per second are generated by the segmentation software analyzingthe raw ionic current signal. The number of segments per unit timedepends on the speed of the DNA motor (enzyme) modulating translocationof the DNA/polynucleotide through the nanopore.

The next step in data analysis after segmenting an ionic current traceis to determine which bases are present at the reading head of thenanopore for each ionic current segment. Currently for the MinION™nanopore sequencer, these ionic current segments depend on the 5contiguous bases proximal to the reading head of the nanopore. A lookuptable containing all 1024 possible combinations of 5 bases is used toidentify these 5 base long blocks or ‘words’ for sequence determinationas they translocate through the nanopore. That is to say, thepolynucleotide is detected by the nanopore as 5 base long words thatmove through the nanopore in 1 base steps. Accordingly, a single baseposition alters ionic current when it is within 5 bases or less of thereading head of the nanopore. Additionally each base position in apolynucleotide influences 5 contiguous ionic current segments. Foranalysis of the polynucleotides of canonical base sequence, hiddenMarkov models (HMMs) based software is used to translate informationfrom segmented current data into DNA sequences. An exemplary hiddenMarkov Model is further described in Sjolender, “Method and apparatususing Bayesian subfamily identification for sequence analysis,” U.S.Pat. No. 6,128,587. HMMs can be constructed to identify sets ofpositions that describe the (more or less) conserved first-orderstructure in a set of sequences. In biological terms, this correspondsto identifying core elements of homologous molecules. HMMs can alsoprovide additional information such as the probability of initiating aninsertion at any position in the model, and the probability of extendingit. The structure of an HMM is similar to that of a profile, withposition-specific insert and delete probabilities. In constructing anHMM or profile for the subfamilies, information can be shared betweensubfamilies at positions where there is evidence of common structuralconstraints.

The HMM for canonical base sequence determinations typically contains1024 states, one for each possible 5-mer. HMM for analysis of abasicsites will include additional states for 5-mers containing abasic sites.The software as described here takes advantage of characteristic currentincreases when the bases of interest are converted to an abasic site.These changes in ionic current permit determination of the location ofmodified bases of interest. The UCSC Nanopore group has previouslydeveloped software to detect cytosine modifications within otherwisecanonical DNA sequences (Schreiber et al, Proc Natl Acad Sci U S A.2013, 110:18910-5; Wescoe et al, J Am Chem Soc. 2014, 136:16582-7).These papers, combined with the present description enable the creationof a custom bioinformatics package to develop the aforementionedsoftware. The combination of the described method, software, andnanopore-based sequencing can be used to perform de novo calling andidentification of bases (canonical or modified).

EXAMPLE 2 Proof that Significant Differences are Obtained from AbasicSites in Sequence

Referring now to FIG. 4A-4D, the distributions of mean currents forsegments from base calling events in the MINION™ sequencer for the fourPCR reactions are shown. FIG. 4A shows the results of a control Lambdafragment; FIG. 4B shows current distributions for control Lambdafragment glycosylase treated; FIG. 4C shows current distributions forcontrol Lambda fragment with dUTP; and FIG. 4D shows currentdistributions for control Lambda fragment with dUTP and glycosylasetreatment Current levels as shown in FIG. 4A-4D were used to determineDNA base identity in MINION™ nanopore sequencing. Abasic positions areassociated with a high current relative to positions within a DNA strandthat contain a base. Currents in the range of approximately 100 pA arefound to a significant degree only in the graph of FIG. 4D. Only thereaction in FIG. 4D is expected to contain abasic sites. Note the peakat high current levels revealed as a spike on the shoulder of thecurrent level distribution circled in 4D and its absence in FIGS. 4A, 4Band 4C.

Referring now to FIG. 5A, 5B, there is illustrated an example of datafor a control polynucleotide and an abasic polynucleotide. FIG. 5A is ananopore sequencing result (raw data) from a non-treated DNA expected tohave no abasic sites. FIG. 5B shows similar DNA containing U and thentreated with a Uracil DNA glycosylase to generate abasic sites. Shownare data from DNA strands processed in MINION™ nanopore sequencing runs.Depicted in panels 5A and 5B are MINION™ raw data current traces. Thepanel 5A is from a Control lambda DNA strand containing the 4 canonicalDNA bases (ACGT) and no abasic sites. The panel 5B is from a lambda DNAstrand that also has Uracil incorporated into it, and then wassubsequently treated with Uracil DNA glycosylase to generate abasicsites at positions of T within the sequence. In panel A, current tracescorresponding to the canonical sequence of the lambda strand do notexceed 90 pA current (straight line at about 90 pA). The high currentspikes at time equals 3820 seconds and approximately 3822 seconds aresequencer imposed currents. Similarly the high current seen at timeequals 3852 seconds is from the adaptor used in library preparation ofDNA for sequencing. In panel 5B, the current trace for a lambda DNAstrand containing abasic sites, current is seen to exceed 90 pA atpositions of T within the sequence marking abasic sites.

EXAMPLE 3 Creation of Abasic Sites in Place of 5-Methylcytosine

FIG. 6 shows one of several best ways of creating abasic sites, inparticular for epigenetic sequencing using the MINION™ nanopore device.Shown there are two approaches to sequencing DNA to determine positionsof the epigenetic marker 5-methylcytosine. In approach I, a5-methylcytosine specific DNA glycosylase is used to generate abasicsites at base positions originally containing 5-methylcytosine. Theposition of the abasic site is determined by the nanopore sequencer andconfirmation of the correct position is made by sequence on the oppositestrand, i.e. of a G base which paired with the 5-methylcytosine beforeglycosylase treatment. In approach 2, the 5-methylcytosine is firstenzymatically deaminated to thymine by 5-methylcytosine deaminase.Subsequent treatment with G/T(U)-mismatch DNA glycosylase removes thethymine, generating an abasic site for detection by the MINION™ nanoporesequencer. The example sequences shown have the sequences

ACCCGTAACabasicGGATTCGCGTAAA (approach I, strand with abasic site;SEQ ID NO: 5), ACCCGTAACTGGATTCGCGTAAA(approach II, strand with T/G mismatch; SEQ ID NO: 6), andTTTACGCGAATCCGGTTACGGGT (complementary strand in both approaches;SEQ ID NO: 7).

EXAMPLE 4 Creating Abasic Sites to Improve Analysis of Regions ofSequencing Difficulty

Various sequencing processes encounter difficulty with regions ofrepetitive bases. It is difficult to determine the number of bases inhomo-polymeric runs or short repeats. FIG. 7A shows one of several bestways of improving sequence accuracy, specifically, in this example, fordetermining homo-polymeric runs of Ts. In this example, one uses initialreplication of the DNA to be sequenced using a high fidelity polymeraseand a mixture of the 4 canonical bases (ACGT) and dUTP. DNA is thentreated with a Uracil DNA glycosylase and a sequencing library preparedfrom this material. The sequences for individual DNA strands are thenaligned with each other noting positions of abasic sites. Contiguousblocks of abasic positions are identified and used to determine thelength of homo-polymeric runs of Ts and As. In the figure, only bases inthe homopolymeric stretch are indicated with explicit bases; the otherbase pairs are represented with vertical lines connecting the twostrands. The steps may also be summarized as follows:

Step 1 Use DNA polymerase to replicate the DNA sequence using dATP,dCTP, dGTP, dTTP and dUTP Step 2 Use Uracil DNA glycosylase to createabasic sites at sites of incorporated Uracil Step 3 Determine basepositions of abasic sites using the MinION Step 4 Align sequences andidentify contiguous blocks of abasic sites in pile ups Step 5Homopolymeric runs of Ts are identified based on abasic positions in theconsensus sequence

A similar approach is detailed for homo-polymer runs of C's and G's inFIG. 7B. The steps may be summarized as:

Step 1 Copy DNA to be analyzed with 1-X cycles of PCR Step 2 Treat w/DNAGlycosylase engineered to remove cytosines and create abasic positionsat cytosine Step 3 Determine base positions of abasic sites using ananopore sequencer. Determine Kmers containing abasic sites w/nanoporesequencer Step 4 Align sequences Step 5 Integrate read information todetermine length of homopolymeric run of Cs

These methods can also be used to improve sequencing accuracy outside ofhomopolymeric tracts.

As described below also in connection with FIG. 10, FIG. 7C, 7D shows asimilar approach for detecting certain bases that can be replaced withbases that are glycosylase substrates. A complementary strand is made,and a hairpin is used to connect these two strands.

The ligated strand is then copied with substituted bases, e.g. Ureplacing T. The U's are removed by uracil-DNA glycosylase treatment.The sequencing thus produces a duplex molecule with one strandcontaining abasic sites and a complementary strand that can be used toconfirm sequence data. FIG. 7C shows the step of ligating a DNA hairpinto one end of DNA and a sequencing adapter to the other end of DNA,followed by a step of copying both strands of the DNA with a DNAPolymerase using one or more modified base(s) (X) as a substitute forone or more of the 4 canonical bases (ACGT). In this example themodified base is uracil which replaces T and basepairs with A. FIG. 7Dshows the step of treating DNA with Uracil DNA Glycosylase. In FIG. 7C,7D, only bases in locations of replacement are indicated with explicitbases; the other base pairs are represented with vertical linesconnecting the two strands. In addition, one may use the present methodsto detect 5-methylcytosine (5mC) in genomic DNA (gDNA) with a nanoporewith other base modification strategies:

Method 1:

5mC can be converted to 5-carboxylcytosine (5caC) via an oxidationreaction using the enzyme Tet1 (commercially available as part of a kit:http (colon slash slash)/www(dot) wisegeneusa.com/#!Tet1/c12zy). ThymineDNA glycosylase (TDG) will then excise the resulting 5caC, thus creatingan abasic site that can be detected using a nanopore and mapped usingcustom-designed software.

Method 2:

Bisulfite treatment of DNA results in conversion of a canonical Cytosine(C) to a Uracil (U) which can then be converted to an abasic residueusing Uracil DNA glycosylase. This process does not affect 5mC. However,5mC can be converted to C by treating DNA with a demethylase enzyme(like this one available as part of a kit: http (colon slash slash)www(dot)epigentek.com/catalog/epiquik-dna-demethylase-activityinhibition-assay-ultra-kit-p-3440.html).To do this one would split the gDNA in two reactions.

For the first reaction, bisulfite treatment of gDNA will convert C's toU's, and then the resulting U's to abasic sites using UDG. This willleave the 5mC's unaffected. The resulting abasic sites can be detectedusing a nanopore and mapped using custom-designed software. These siteswill indicate the presence of C's in gDNA.

For the second reaction, first convert 5mC's in genomic DNA to C's usinga demethylase, and then perform bisulfite treatment to convert all C's(original C's as well as the 5mC's that were converted) to U's. TheseU's will convert to abasic sites using UDG. The abasic sites can bedetected using a nanopore and mapped using custom-designed software.These sites will indicate the presence of C's as well as 5mC's in gDNA.Custom designed software to use information from the two reactions willhelp discern the presence and location of 5mC's in gDNA.

EXAMPLE 5 Improving RNA Sequencing by Creating Abasic Sites at SpecificPoints

FIG. 8 illustrates a method for improving RNA sequencing using abasicsites. Briefly, an RNA molecule can be reverse transcribed to form acDNA-RNA hybrid. Nucleotide incorporation is performed using dNTP's, butwith dUTP substituting for dTTP. The resulting cDNA that contains dUTPcan be then treated by Uracil DNA Glycosylase to form abasic sites thatcan be detected using a nanopore-based sequencer that will providefurther data on the RNA sequence, which will be sequenced both directlyas RNA and as a complementary treated DNA strand that contains abasicsites at known locations. RNA that may be sequenced includes but is notlimited to mRNA (messenger RNA), tRNA (transfer RNA), and rRNA(ribosomal RNA). “X” in this example represents dUTP incorporation inDNA complementary to A in the RNA but may also be other substitutionswhich can be removed by a specific DNA glycosylase. The examplesequences shown have the sequences UCCCGUGGCCGUAGGCGCGUCGG (RNA strandto be sequenced; SEQ ID NO: 8), CCGACGCGCCXACGGCCACGGGA (cDNA strandwith non-standard base “X”, uracil in the present example; SEQ ID NO:9), and CCGACGCGCCabasicACGGCCACGGGA (cDNA strand with abasic site; SEQID NO: 10).

EXAMPLE 6 Epigenetic Analysis of Base Modifications

There is presently a need for determining modifications that are notattributable to changes in the primary DNA sequence. Epigeneticmodifications play a crucial role in gene expression, and therebyunderpin the development, regulation, and maintenance of the normalcell. A commonly studied epigenetic modification is the methylation ofcytosine (C) nucleotides in the context of a CpG dinucleotide.Historically, restriction enzymes have been used as one method to detectDNA methylation. The present methods can yield simultaneous sequencingand base modification determinations in a defined work flow using baseenzymatic modification and nanopore-based sequencing.

Shown in FIG. 9 are structures of cytosine and modified cytosines. Asample under study can be treated with a series of enzymaticmodifications to produce new bases in the sequence; these new bases aresubstrates for known glycosylases which then produce abasic sites thatare analyzed as described above. Shown below is a work flow fordetecting and distinguishing cytosine, 5-methylcytosine (5mC),5-hydroxymethylcytosine (5hmC), 5-formyl (5fC) and5-carboxycytosine(5caC) in a single polynucleotide.

To detect a variety of modified cytosine bases, one treats aliquots ofDNA to be sequenced as follows:

-   5mC→5-methylcytosine DNA glycosylase→abasic site; or-   5mC→convert to T/U mismatch using 5-methylcytosine deaminase→G/T(U)    mismatch DNA glycosylase→abasic site

5hmC+KRuO4→5fC→thymine DNA glycosylase→abasic site

-   5fC→thymine DNA glycosylase→abasic site-   5caC→tymine DNA glycosylase→abasic site.

The use of KRuO₄ (potassium perruthenate) and the use of this chemicalfor conversion of 5-hydroxymethylcytosine to 5-formylcytosine arefurther described in “Quantitative Sequencing of 5-Methylcytosine and5-Hydroxymethylcytosine at Single-Base Resolution,” Booth et al. Science336 (6083) 934-937 (May 2012) and “Oxidative bisulfite sequencing of5-methylcytosine and 5-hydroxymethylcytosine,” Booth et al., Nat.Protocol. 8(10): 1841-1851 (October 2013).

As above, the abasic sites are in predetermined points in thepolynucleotide sequence, and can be readily determined by the sequencingsignal from an abasic site.

EXAMPLE 7 5-Methylcytosine Detection with Sequencing of Template andComplement Strands using a DNA Hairpin and Strand Replication with a DNAPolymerase

As described above, the present methods may be used to detect epigeneticmodifications such as 5-methylcytosine. FIG. 9 (discussed above) showsdifferent modified cytosines that can be treated and identified usingthe present methods. In this example, DNA to be sequenced is copied by aDNA polymerase to provide a canonical copy and a copy with abasicpositions at sites of 5mC all in one contiguous DNA strand (FIG. 10) fornanopore sequencing. The first step involves ligation of a primer/DNAbarcode to one end of the DNA molecule and a low melting DNA hairpin tothe other end. The primer/DNA barcode provides a primer sequence forcopying the DNA strands and the hairpin provides a mechanism for keepingthe original template and complement strands attached to each otherduring enzymatic copying. The DNA is copied using a single round of PCRto generate strand 1 paired with strand 2 (shown in FIG. 10). Strand 1(top strand in FIG. 10) is the original template and complementarystrands to be sequenced, connected by the linearized hairpin sequence.The hairpin is illustrated as appearing between the template andcomplementary strands. Each end of the construct consists of aprimer/DNA barcode (sequencing adapter sequence). Strand 2 (shownbeneath and complementary to strand 1) is a copy of the original duplexmolecule and does not contain modified bases, due to the bases used forthe copying step. The complex is treated with 5-methylcytosineglycosylase or 5-methycytosine deaminase and G/t(U)-mismatch DNAglycosylase to create abasic sites in only the original strands atpositions of 5-methylcytosine. This results in abasic sites at5-methylcytosine positions in strand 1 and a copy strand containing onlycanonical bases in strand 2. Two abasic sites are shown in this example.Only bases in locations where abasic sites are generated are illustratedhere; other base pairs are represented with vertical lines connectingthe two strands. The samples may be prepared using standard nanoporesequencer commercial library kits that use a hairpin to connect Strands1 and 2 together for contiguous sequencing of both DNA strands. Ananopore sequencer and sequencing software described above will thendetermine the positions of the abasic sites in the original DNA strand,and confirm base identity by comparison of the base sequence of thecopied strand with the original template and complement strandsoriginally containing the modified bases which have now been convertedto abasic positions.

EXAMPLE 8 Use of DNA Glycosylases and Corresponding Modified BasesDetected

Table 1, extracted from Krokan, Standal, and Slupphaug (1997),illustrates a variety of glycosylases that can be used to create abasicsites at predetermined sites in a polynucleotide sequence and thusidentify a wide variety of modified bases in a polynucleotide. Thecolumn labeled “beta lyase activity” refers to the activity of theenzyme in column 1 in cleaving the polynucleotide. This activity may beremoved by engineering the enzyme to remove this activity, whileretaining the cleavage activity shown in FIG. 1.

TABLE 1 Reported DNA Modified Base Beta-Lyase Enzyme Source/GeneSubstrates Activity Uracil-DNA Viral Uracil No GlycosylaseBacterial/(UNG) 5-Fluorouracil, Isodiauric Acid, No 5-HydroxyuracilYeast (S. cerevisiae) (UNG1) Uracil No Plants Uracil No Human(UNG)5-Fluorouracil, Alloxan, 5- No Hydroxyuracil G/T(U)mismatch- M.thermoautotropicum G/G, A/G, T/C, U/C No DNA Glycosylase Insects Uracilmismatch No Human T and U mismatch No Alkylbase-DNA E. coli (tag)3-methyl guanine No Glycosylases E. coli (alkA) O²-Alkylcytosine, No5-formyluracil, 5-hydroxymethyluracil, hypoxanthine, N⁶-ethenoadeinine,N⁴-ethenocytosine S. cerevisiae (MAG) 7-chloroethyl-guanine, Nohypoxanthine, N⁶-ethenoadeinine, S. pombe (mag1) 3-Methyladenine Unk. A.Thaliana (MPG) 3-Methyladenine Unk. Rodent/Human (MPG)7-chloroethyl-guanine, No 8-oxoguanine hypoxanthine, N⁶-ethenoadeinine,5-Methylcytosine- Chick T in G/T mismatch No DNA Glycosylase Embryo5-methylcytosine No Thymine DNA Human (TdG) 5-formylmethylcytosine (5fC)No Glycosylase Human (TdG) 5-carboxylcytosine (5caC) No Adenine-specificE. coli (mutY) A in G/A and C/A Yes/No mismatch DNA Bovine, Human (MYH)A in G/A and CA Yes Glycosylases also 8-oxoguanine DNA Glycosylases E.coli EndoIII (nth) 5-hydroxycytosine, Yes removing oxidized5,6-Dihydrothymine, pyrimidines 5-Hydroxy-5,6- (EndoIII-like)dihydrothymine, Thymine glycol, Uracil glycol, Alloxan,5,6-Dihydroxyuracil, 5-Hydroxy-5,6-dihydroxyuracil, 5-Hydroxyuracil,5-Hydroxyhydantoin S. cerevisiae (NTG1) 2,5-Amino-5- Yesformamidopyrimindine, 4,6-Diamino-5- formamidopyrimidine,2,6-Diamino-4-hydroxy-5- foramimidopyrimidine, Thymine glycol S. pombe(nth) Thymine glycol, Yes 5-Hydroxy-uracil Bovine/human EndoIII Thymineglycol Yes EndoVIII E. coli 5,6-Dihydrothymine, Yes Thymine glycolEndoIX E. coli Urea Unk. Hydroxymethyl- Mouse Uracil No DNA glycosylaseBovine 5-hydroxymethyluracil Unk. Formyluracil-DNA Human 5-formyluracilUnk. glycosylase DNA glycosylases E. coli (fpg) 8-oxoguanine, Yesremoving oxidized 2,5-Amino-5- purines formamidopyrimidine,4,6-Diamino-5- formamidopyrimidine, 2,6-Diamino-4-hydroxy-5-foramimidopyrimidine S. cerevisiae (OGG1) 8-oxoguanine (opposite T) YesS. cerevisiae (OGG2) 8-oxoguanine (opposite A) Yes D. melanogaster S38-oxoguanine Yes Pyrimidine-dimer- T4 4,6-Diamino-5- Yes DNAglycosylases formamidopyrimidine M. luteus Cyclobutane-pyrimidine dimerYes N. mucosa Yes

A more recent list of glycosylase enzymes is included in Table 2.

TABLE 2 DNA Glycosylases Extracted from-Recent Advances in thestructural mechanisms of DNA glycosylases. Brooks et al. 2013. GeneSymbol Source OGG1 Eukaryotes Archaea Prokaryotes OGG2 Eukaryotes AGOCArchaea MutM/Fpg Prokaryotes NTH1 Eukaryotes EndoIII Archaea Nth/EndoIIIProkaryotes NEIL1 Eukaryotes Nei/EndoVIII Prokaryotes NEIL2 EukaryotesNEIL3 Eukaryotes AAG Eukaryotes MAG1 Eukaryotes AfA1kA Archaea MpgIIArchaea AlkA Prokaryotes MagIII Prokaryotes TAG Prokaryotes AlkCProkaryotes AlkD Prokaryotes MUTYH Eukaryotes MutY Archaea MutYProkaryotes UDG Eukaryoties Ung Archaea UDG-1 Prokaryotes SMUGEukaryotes UDG-3 Prokaryotes TDG Eukaryotes MUG Archaea UDG-2Prokaryotes UDG Archaea (Thermus UDG-4 thermophiles) Prokaryotes MBD4Eukaryotes MIG Archaea DME Eukaryotes ROS1 Eukaryotes DML2 EukaryotesDML3 Eukaryotes

EXAMPLE 9 Engineering of DNA Glycosylases Including Removal of theirBeta-Lyase Activity and Changing Specificity

It is preferred that the glycosylases used herein do not contain betalyase activity, i.e. do not cleave the strand when excising the base tobe removed. The DNA glycosylases listed above can be modified by routineexperimentation to lack the beta lyase activity. This is described, e.g.in:

Recent advances in the structural mechanisms of DNA glycosylases.Brooks, Sonja C.; Adhikary, Suraj; Rubinson, Emily H.; and Eichman,Brandt F. Biochimica et Biophysica Acta-Proteins and Proteomics Volume:1834 Issue: 1 Pages: 247-271 Published: January 2013; Base ExcisionRepair. Krokan, Hans E.; Bjoras, Magnar. Cold Spring Harbor Perspectivesin Biology 5(4) Number: a012583, April 2013; “DNA glycosylases in thebase excision repair of DNA,” Krokan, H E; Standal, R; Slupphaug, G.Biochemical JournaL Volume: 325 Pages: 1-16 Part: 1 Published: Jul. 11997.

Briefly, the glycosylase of interest will be part of one of two classesof enzymes, having known mechanisms of activity. The first class ismono-functional and cleaves the base from the sugar phosphate backboneby cleavage of the N-glycosidic bond to yield a free base and an abasicsite in the DNA or RNA strand. The second group of bifunctionalglycosylases has both base cleavage activity and beta-lyase activity(cleavage of the sugar phosphate backbone). The above papers describeglycosylases in general and include information needed for engineeringthe second group of enzymes to eliminate beta lyase activity and in turnmake these enzymes more useful in nanopore sequencing of nucleic acidswith modified bases. The bi-functional enzymes are believed to involvean intermediate in which the sugar phosphate backbone is covalentlylinked to the glycosylase, whereas the mono-functional glycosylase classdoes not proceed using this mechanism. Additionally, the mechanism forgeneration of the nucleophilic intermediate responsible for basecleavage differs between these two classes of glycosylases. The aminoacid residues involved in generation of a nucleophile for base cleavage(as well as covalent attachment of enzyme to sugar phosphate backbone)are different between the bi-functional glycosylases and themono-functional glycosylases. Our approach is to engineer thebi-functional glycosylases using site directed mutagenesis, to changethose amino acid residues involved in either nucleophile generation forbase cleavage or points of covalent attachment of the bi-functionalglycosylase to the sugar phosphate backbone as an intermediate in thereaction mechanism or both.

In addition, a known enzyme can be altered to change its substrate. Insite directed mutagenesis of human uracil DNA glycosylase, conversion ofTyr147 to Ala147 resulted in the human UDG cleaving both uracil andthymine. This could be a method for improving sequence accuracy bydirect treatment of the DNA with this mutant glycosylase followed bysequencing (which would allow resolution of homopolymer tracts of T'sand A's). Similarly, changing Asn204 to Asp204 results in theglycosylase cleaving both uracil and cytosine. This would alsopotentially improve sequencing accuracy. This is described in “Excisionof cytosine and thymine from DNA by mutants of human uracil-DNAglycosylase,” Kav B; Slupphaug, G; Mol, C D; et al. EMBO JOURNAL Volume:15 Issue: 13 Pages: 3442-3447 Published: Jul. 1 1996.

Conclusion

The above specific description is meant to exemplify and illustrate theinvention and should not be seen as limiting the scope of the invention,which is defined by the literal and equivalent scope of the appendedclaims. Any patents or publications mentioned in this specification areintended to convey details of methods and materials useful in carryingout certain aspects of the invention which may not be explicitly set outbut which would be understood by workers in the field. Such patents orpublications are hereby incorporated by reference to the same extent asif each was specifically and individually incorporated by reference andcontained herein, as needed for the purpose of describing and enablingthe method or material referred to.

1-28. (canceled)
 29. A method of detecting a sequence in an RNApolynucleotide molecule, the method comprising: (a) treating the RNApolynucleotide molecule with an RNA glycosylase that creates an abasicsite corresponding to pseudouridine (Ψ), diydrouridine (D), inosine (I),and 7-methylguanosine (m7g) species in the polynucleotide; (b)conducting single molecule sequencing on the polynucleotide prepared instep (a) where the sequencing indicates the abasic site within thepolynucleotide sequence; and (c) using the sequence from step (b) toidentify the abasic site and correlating said abasic site topseudouridine ('P), diydrouridine (D), inosine (I), and7-methylguanosine (m7g).
 30. The method of claim 29, wherein the RNAglycosylase is EC 3.2.2.22.
 31. The method of claim 29, wherein the RNAglycosylase lacks beta lyase activity.
 32. The method of claim 31,wherein the RNA glycosylase is engineered to lack beta lyase activity.33. The method of claim 29, wherein conducting single moleculesequencing comprises nanopore-based sequencing comprises measuring anionic current that identifies an abasic site.
 34. The method of claim33, wherein the nanopore-based sequencing includes detecting an ioniccurrent through a nanopore through which the polynucleotide passes. 35.The method of claim 31, wherein conducting single molecule sequencingcomprises nanopore-based sequencing comprises measuring an ionic currentthat identifies an abasic site.
 36. The method of claim 35, wherein thenanopore-based sequencing includes detecting an ionic current through ananopore through which the polynucleotide passes.
 37. The method ofclaim 29, wherein the RNA polynucleotide molecule is mRNA.
 38. Themethod of claim 29, wherein the RNA polynucleotide molecule is tRNA. 39.The method of claim 29, wherein the RNA polynucleotide molecule isgenomic RNA.